In this project, a wine quality dataset set will be used to study the effect of different parameters on the red and white wine quality. The objective is to determine the top three predictors for the red and white wine quality. In fact, this is a predictive question, so this study uses a machine learning method to predict the wine quality (a target variable) based on the wine features. A tree classifier has been used for this purpose, and the top three predictors were determined using this classifier.
The wine quality data was obtained from UCI Machine Learning Repository [1], however, the original data was prepared by P. Cortez [2]. The wine data is divided into to datasets for the red and white variants of the Portuguese "Vinho Verde" wine. The most common physicochemical (features) and sensory (target) variables are available in these two datasets, and they have 12 with 1599 red and 4898 white examples totally [3]. The features include fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates and alcohol. The target variable is the wine quality which is defined as a numerical score between 0 (the worst wine) and 10 (the best wine). However, in these datasets there is no observation with a quality of lower than 3 and higher than 9. Table 1 shows the summary of feature statistics for each dataset.
Table 1. The summary of feature statistics for each dataset
| Feature | Red wine | White wine | ||||
|---|---|---|---|---|---|---|
| Min | Mean | Max | Min | Mean | Max | |
| Fixed acidity | 4.60 | 8.32 | 15.90 | 3.80 | 6.85 | 14.20 |
| Volatile acidity | 0.12 | 0.53 | 1.58 | 0.08 | 0.28 | 1.10 |
| Citric acid | 0.00 | 0.27 | 1.00 | 0.00 | 0.33 | 1.66 |
| Residual sugar | 0.90 | 2.54 | 15.50 | 0.60 | 6.39 | 65.80 |
| Chlorides | 0.01 | 0.09 | 0.61 | 0.01 | 0.05 | 0.35 |
| Free sulfur dioxid | 1.00 | 15.87 | 72.00 | 2.00 | 35.31 | 289.00 |
| Total sulfur dioxid | 6.00 | 46.47 | 289.00 | 9.0 | 138.4 | 440.0 |
| Density | 0.990 | 0.997 | 1.004 | 0.987 | 0.994 | 1.039 |
| pH | 2.74 | 3.31 | 4.01 | 2.72 | 3.19 | 3.82 |
| Sulphates | 0.33 | 0.66 | 2.00 | 0.22 | 0.49 | 1.08 |
| Alcohol | 8.40 | 10.42 | 289.00 | 8.00 | 10.51 | 14.20 |
| Quality | 3.00 | 5.64 | 8.00 | 3.00 | 5.88 | 9.00 |
The datasets have been checked to make sure that no missing element is present. The classes in the datasets for both red and white wine are not balanced. That is because there are much more normal wines than good or bad ones. Figure 1 shows the bar plot of the number of observations for different wine qualities in the red wine dataset. Figure 2 shows a similar bar plot for the white wine dataset (the number of observations for each quality have been given on top of the bars).
Figure 1. The bar plot of the number of observations for different wine qualities in the red wine dataset
Figure 2. The bar plot of the number of observations for different wine qualities in the white wine dataset
To overcome these problems, some of the values if the quality variables have combined together and the resulting values have been turned into categorical values. To goal was to combine the qualities with a very low number of observations. Two different patterns have been tried for this purpose. Figures 3 and 4 show how this transformation has been done for the first pattern.
Figure 3. The bar plot of the number of observations for different wine qualities in the cleaned red wine dataset (pattern 1)
Figure 4. The bar plot of the number of observations for different wine qualities in the cleaned white wine dataset (pattern 1)
Figures 5 and 6 show how this transformation has been done in the second pattern.
Figure 5. The bar plot of the number of observations for different wine qualities in the cleaned red wine dataset (pattern 2)
Figure 6. The bar plot of the number of observations for different wine qualities in the cleaned white wine dataset (pattern 2)
The effect of each wine feature on the quality has been studied using exploratory data visualization of the cleaned datasets. Figure 7 shows the violin and jitter plots of each feature versus red wine quality in the cleaned data (pattern 1). The error bars have been shown too. Figure 8 shows a similar plot for the white wine in the cleaned data (pattern 1).
Figure 7. The violin and jitter plot of each feature versus red wine quality (with error bars)
Figure 8. The violin and jitter plot of each feature versus white wine quality (with error bars)
These figures suggest that alcohol is probably the most important predictor for both red and white wines since the difference between the alcohol value of different qualities is larger than its standard error for both red and white wines. Figure 9 and 10 show the violin and jitter plots of each feature versus the wine quality in the cleaned data (pattern 2) for red and white wines.
Figure 9. The violin and jitter plot of each feature versus red wine quality (with error bars)
Figure 10. The violin and jitter plot of each feature versus white wine quality (with error bars)
In these figures again alcohol looks like an important feature, however, for low and med values the error bars are overlapped for both red and white wines. On the other hand, sulphates looks a little better than alcohol for the red wine.
In this study, the three top features that affect the wine quality for both the red and white wines should be determined, and a decision tree classifier will be used for this purpose. A decision tree is a classifier that uses a tree-like model for the decisions and their possible outcomes. What makes it useful for this study is that The lesson is it chooses the attributes one at a time, according to some criteria, and it tries to find the best attribute each time, so it allows to rank the predictive features and choose the best ones [4]. Since this study only looks for the top three predictors, the tree depth will be set to 3, and this hyperparameter will not be subjected to optimization. However, a 10-fold cross validation with a 90:10 ratio (train: validation) was used to calculate the accuracy of the classifier on the validation sets. Both pattern 1 and 2 for the red and white wine have analyzed using the decision tree classifier. In addition, the scikit learn class_weight attribute has been used to balance the observations. It can automatically adjust some weights inversely proportional to class frequencies in the input data [5]. The analysis has been done with and without balancing the datasets.
Figures 11 and 12 show the decision tree for both red and white wines in pattern 1 without balancing.
Figure 11. The decision tree for the red wine (pattern 1) without balancing
Figure 12. The decision tree for the white wine (pattern 1) without balancing
Figure 13. The decision tree for the red wine (pattern 1) with balancing
Figure 14. The decision tree for the white wine (pattern 1) with balancing
Figure 15. The decision tree for the red wine (pattern 2) without balancing
Figure 16. The decision tree for the white wine (pattern 2) without balancing
Figure 17. The decision tree for the red wine (pattern 2) with balancing
Figure 18. The decision tree for the white wine (pattern 2) with balancing
Table 2 summarizes all the results of the previous figures. It shows the top three predictors for each tree. To determine the top three predictors, the 'tree.feature_importances_' of DecisionTreeClassifier in scikit learn has been used. It returns the Gini importance of that feature. In addition, it shows the average accuracy of a 10-fold cross validation with 90:10 ratio for train: validation sets.
Table 2. A summary of the decision tree analysis
| Cleaning pattern | Wine type | Balancing used in analysis | Accuracy | First top predictor | Second top predictor | Third top predictor |
|---|---|---|---|---|---|---|
| 1 | Red | No | 54% | Alcohol | Sulphates | Total sulfur dioxide |
| 1 | Red | Yes | 47% | Alcohol | Sulphates | Volatile acidity |
| 1 | White | No | 53% | Alcohol | Volatile acidity | Free sulfur dioxide |
| 1 | White | Yes | 46% | Alcohol | Free sulfur dioxide | Volatile acidity |
| 2 | Red | No | 84% | Alcohol | Volatile acidity | Sulphates |
| 2 | Red | Yes | 53% | Sulphates | Alcohol | Volatile acidity |
| 2 | White | No | 75% | Alcohol | Volatile acidity | pH |
| 2 | White | Yes | 57% | Alcohol | Free sulfur dioxide | Volatile acidity |
The first thing that can be concluded from this table is that pattern 2 gives higher accuracies than pattern 1. This can be attributed to the way that the features combined. The quality of a wine is assessed and scored by a human taster. In pattern 2 we have low, med and high quality wines. However, in pattern 1 we have two large classes of average wines with the scores of 5 and 6. Distinguishing between a very good wine, a very bad wine, and an average wine is much easier than distinguishing between two average wines, and it is likely that these human taster do not have strict rules to distinguish between these two scores (5 and 6) and simply pick a number for that. So it can be very difficult for the learning algorithm to classify these two classes and it will affect the accuracy. So it is evident that pattern 2 is a much better way to clean the dataset and it is not a good idea to use a learning algorithm to classify the middle classes based on the scores of a human expert.
It is important to note that only for one case (Pattern 1, red wine with balancing), there was a mismatch between the tree structure and the reported important features. In this case, the alcohol has the highest Gini impact reported by scikit learn, however in the tree structure (Figure 13), sulphates sits at the root of the tree. The reason is that in the tree structure alcohol is used three times to split the nodes, so its total impact is higher than sulphates.
In addition, it seems that when balancing is done, both patterns 1 and 2 give the same order of predictors for the red and wine wines.
It seems that balancing the data set in scikit-learn decreases the accuracy. This can be attributed to the fact that without proper weighting of the less abundant target categories, the algorithm can get a higher score with labeling most of the examples with more abundant target categories and ignoring the less abundant ones at a lower cost.
Finally, we pick the three top predictors from the decision analysis of pattern 2 with balancing. For red wines, they are sulphates, alcohol, and volatile acidity respectively. For white wines, they are alcohol, free sulfur dioxide, and volatile acidity. These results are consistent with the exploratory data analysis results mentioned before. So Alcohol and volatile acidity are the common important features for both red and white wines.
[1] Wine Quality Data Set, UCI Machine Learning Repository. [2] P. Cortez, University of Minho, Guimares, Portugal (http://www3.dsi.uminho.pt/pcortez). [3] P. Cortez at al., "Modeling wine preferences by data mining from physicochemical properties", Decision Support Systems 47 (2009) 547-533. [4] M. Kubat, An Introduction to Machine Learning, Springer International Publishing, 2015.
[5] Scikit-learn documentation.